A Supervised Visual Wrapper Generator for Web-Data Extraction
نویسندگان
چکیده
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper, we propose a novel schema-guided approach to wrapper generation. We provide a user-friendly interface that allows users to define the schema of the data to be extracted, and specifies mappings from a HTML page to the target schema. Based on the mappings, the system can automatically generate an extraction rule to extract data from the page. Our approach to wrapper generation can significantly reduce the work of human beings in this process. And the user never have to deal with the internal extraction rule, or even familiarity with the details of HTML.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملSupervised Wrapper Generation with Lixto
We illustrate basic features of the Lixto wrapper generator such as the user and system interaction, the capacious visual interface, the marking and selecting procedures, and the extraction tasks by describing the construction of a simple example program in the current Lixto prototype.
متن کاملAnnotating the Legacy Web with Lixto
Introduction The Semantic Web is still a vision. The unstructured Web of today contains millions of documents which cannot be queried and where layout and structure are heavily mixed. Moreover, they are not annotated at all. There is a huge gap between Web information and the qualified, structured data as required in corporate information systems. According to the vision of the Semantic Web, al...
متن کاملVisual Web Information Extraction with Lixto
We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been implemented, assists the user to semi-automatically create wrapper programs by providing a fully visual ...
متن کاملFundamentals Formal Foundations and Semantics of Data Extraction
SYNONYMS web data extraction toolkit, web information extraction system, wrapper generator, wrapper generator toolkit, web macros, web scraper. DEFINITION A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction pe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003